What is ( tagged ) text ?

نویسنده

  • Frank Wm Tompa
چکیده

In working on the New OED project, we, like many other researchers, have wrestled with large, intricate bodies of text. Based on this exposure, we have begun to investigate the similarities and differences between managing conventional business data and managing reference text data. The paper begins with the claim that text can support complex models of the real world that cannot be captured more formally. Thus important information resources must be held as text, but the very absence of a formal model makes it difficult to identify the structures present in a text. A common text structuring technique is descriptive markup, which introduces tags into a text stream. We present three views of tagged text: one based on tags as text, one on arbitrarily interleaved tags with text, and one on constrained tag placement in the text. Throughout the discussion, examples are drawn from our experience with the OED. 1. Text as a model The role of a database is to model an enterprise, so that when queries are posed against the database, information can be obtained about the enterprise. Similarly a reference text is consulted to obtain information about aspects of our collective knowledge as modelled by its contents. A reference text database must capture the information of the reference materials, so that it can provide answers to queries for information about the same collective knowledge. Unfortunately working with a reference text database is not as simple as working with a conventional database, because the content is not formally constrained: modelling with text does not distinguish which aspects of perceived reality are captured in the database and which are omitted [Kent78]. Whereas conventional database design begins with a business analysis to determine the users' requirements, followed by a synthesis of a model to capture all the relevant features in a highly structured form, text demands more editorial freedom. Consider the following two definitions from the OED2:

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpus based coreference resolution for Farsi text

"Coreference resolution" or "finding all expressions that refer to the same entity" in a text, is one of the important requirements in natural language processing. Two words are coreference when both refer to a single entity in the text or the real world. So the main task of coreference resolution systems is to identify terms that refer to a unique entity. A coreference resolution tool could be...

متن کامل

A Comparing between the impacts of text based indexing and folksonomy on ranking of images search via Google search engine

Background and Aim: The purpose of this study was to compare the impact of text based indexing and folksonomy in image retrieval via Google search engine. Methods: This study used experimental method. The sample is 30 images extracted from the book “Gray anatomy”. The research was carried out in 4 stages; in the first stage, images were uploaded to an “Instagram” account so the images are tagge...

متن کامل

PAYMA: A Tagged Corpus of Persian Named Entities

The goal in the named entity recognition task is to classify proper nouns of a piece of text into classes such as person, location, and organization. Named entity recognition is an important preprocessing step in many natural language processing tasks such as question-answering and summarization. Although many research studies have been conducted in this area in English and the state-of-the-art...

متن کامل

Comma checking in Danish

This paper describes research in using the Brill tagger (Brill 94,95) to learn to identify incorrect commas in Danish. Trained on a part-of-speech tagged corpus of 600,000 words, the system identifies incorrect commas with a precision of 91% and a recall of 77%. The system was developed by randomly inserting commas in a text, which were tagged as incorrect, while the original commas were tagged...

متن کامل

New directions in document formatting: What is text?

What is text? Beyond, that is, material “tagged” as CDATA in the mark-up syntax jargon. Is it just undifferentiated strings of bytes? Or of “characters”? . . . But then, what is “a character”? These are the most basic of many important questions whose investigation is timely in light of the rapid global spread of the “XML paradigm” for documents and networked applications of all types. They lie...

متن کامل

The Reading Crisis in Iran (During the 1960s and 1970s): A Critical Discourse Analysis

Purpose: Reading is one of the challenging problems in contemporary Iran. After the Persian Constitutional Revolution (1905-1911), reading becomes one of the factors that Iranians considered it necessary for modernization and development. For this reason, most people, even who were literate, had no desire to read. This situation was unpleasant for intellectuals, publishers and cultural activist...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1989